Utilizando a base descrita e disponibilizada em aula o objetivo do trabalho é mensurar a variável “Quality” dos vinhos desta região de Portugal com as variáveis de características (composição) dos vinhos.
Etapa 1 (Base)
Etapa 2 (Algoritmos explicar variável Quality)
Etapa 3 (Algoritmos explicar variável “Quality”: Vinhos bons e ruins)
Etapa 4 (Análise sobre outras possíveis técnicas)
require(DT)
## Loading required package: DT
require(plotly)
## Loading required package: plotly
## Loading required package: ggplot2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
require(corrgram)
## Loading required package: corrgram
require(outliers)
## Loading required package: outliers
require(caTools) # data splitting
## Loading required package: caTools
require(dplyr) # data wrangling
require(rpart) # performing regression trees
## Loading required package: rpart
require(rpart.plot) # plotting regression trees
## Loading required package: rpart.plot
require(ipred) # bagging
## Loading required package: ipred
require(caret) # bagging
## Loading required package: caret
## Loading required package: lattice
##
## Attaching package: 'lattice'
## The following object is masked from 'package:corrgram':
##
## panel.fill
require(pROC)
## Loading required package: pROC
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
Vinhos <- read.csv2("BaseWine_Red_e_White2018.csv", row.names=1)
attach(Vinhos)
head(Vinhos)
## fixedacidity volatileacidity citricacid residualsugar chlorides
## 1 6.6 0.24 0.35 7.70 0.031
## 2 6.7 0.34 0.43 1.60 0.041
## 3 10.6 0.31 0.49 2.20 0.063
## 4 5.4 0.18 0.24 4.80 0.041
## 5 6.7 0.30 0.44 18.75 0.057
## 6 6.8 0.50 0.11 1.50 0.075
## freesulfurdioxide totalsulfurdioxide density pH sulphates alcohol
## 1 36 135 0.99380 3.19 0.37 10.5
## 2 29 114 0.99014 3.23 0.44 12.6
## 3 18 40 0.99760 3.14 0.51 9.8
## 4 30 113 0.99445 3.42 0.40 9.4
## 5 65 224 0.99956 3.11 0.53 9.1
## 6 16 49 0.99545 3.36 0.79 9.5
## quality Vinho
## 1 5 WHITE
## 2 6 WHITE
## 3 6 RED
## 4 6 WHITE
## 5 5 WHITE
## 6 5 RED
Temos um total de 13 colunas, com duas possíveis variáveis targets a variável quality e Vinho. A coluna Vinho já está corretamente indicada como sendo categórica, e podemos fazer a mesma coisa para quality mais pra frente se acharmos necessário.
sapply(Vinhos, function(x) sum(is.na(x)))
## fixedacidity volatileacidity citricacid
## 0 0 0
## residualsugar chlorides freesulfurdioxide
## 0 0 0
## totalsulfurdioxide density pH
## 0 0 0
## sulphates alcohol quality
## 0 0 0
## Vinho
## 0
Conseguimos observar que não temos nenhum valor faltante nesse dataset, eliminando a necessidade de tratar esses valores.
summary(Vinhos)
## fixedacidity volatileacidity citricacid residualsugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.60
## 1st Qu.: 6.400 1st Qu.:0.2300 1st Qu.:0.2500 1st Qu.: 1.80
## Median : 7.000 Median :0.2900 Median :0.3100 Median : 3.00
## Mean : 7.215 Mean :0.3397 Mean :0.3186 Mean : 5.44
## 3rd Qu.: 7.700 3rd Qu.:0.4000 3rd Qu.:0.3900 3rd Qu.: 8.10
## Max. :15.900 Max. :1.5800 Max. :1.6600 Max. :45.80
## chlorides freesulfurdioxide totalsulfurdioxide density
## Min. :0.00900 Min. : 1.00 Min. : 6.0 Min. :0.9871
## 1st Qu.:0.03800 1st Qu.: 17.00 1st Qu.: 77.0 1st Qu.:0.9923
## Median :0.04700 Median : 29.00 Median :118.0 Median :0.9949
## Mean :0.05603 Mean : 30.53 Mean :115.7 Mean :0.9947
## 3rd Qu.:0.06500 3rd Qu.: 41.00 3rd Qu.:156.0 3rd Qu.:0.9970
## Max. :0.61100 Max. :289.00 Max. :440.0 Max. :1.0140
## pH sulphates alcohol quality
## Min. :2.720 Min. :0.2200 Min. : 0.9567 Min. :3.000
## 1st Qu.:3.110 1st Qu.:0.4300 1st Qu.: 9.5000 1st Qu.:5.000
## Median :3.210 Median :0.5100 Median :10.3000 Median :6.000
## Mean :3.219 Mean :0.5313 Mean :10.4862 Mean :5.818
## 3rd Qu.:3.320 3rd Qu.:0.6000 3rd Qu.:11.3000 3rd Qu.:6.000
## Max. :4.010 Max. :2.0000 Max. :14.9000 Max. :9.000
## Vinho
## RED :1599
## WHITE:4898
##
##
##
##
Aqui podemos observar que residualsugar, chlorides, freesulfurdioxide e totalsulfurdioxide tem valores muito espaçados de minimos e máximos, podendo indicar alguns outliers e também há uma falta de equilíbrio entre a quantidade de vinhos RED e WHITE podendo interferir nos resultados de classificação.
BoxPlot das features
p1 <- plot_ly(y = fixedacidity, type="box", name = "Fixed Acidity")
p2 <- plot_ly(y = volatileacidity, type="box", name = "Volatile Acidity")
p3 <- plot_ly(y = citricacid, type="box", name = "Citric Acid")
p4 <- plot_ly(y = residualsugar, type="box", name = "Residual Sugar")
p5 <- plot_ly(y = chlorides, type="box", name = "Chlorides")
p6 <- plot_ly(y = freesulfurdioxide, type="box", name = "Free Sulfur Dioxide")
subplot(p1, p2, p3, p4, p5, p6, nrows=3)
p7 <- plot_ly(y = totalsulfurdioxide, type="box", name = "Total Sulfur Dioxide")
p8 <- plot_ly(y = density, type="box", name = "Density")
p9 <- plot_ly(y = pH, type="box", name = "PH")
p10 <- plot_ly(y = sulphates, type="box", name = "Sulphates")
p11 <- plot_ly(y = alcohol, type="box", name = "Alcohol")
p12 <- plot_ly(y = quality, type="box", name = "Quality")
subplot(p7, p8, p9, p10, p11, p12, nrows=3)
Historigramas das features
p1 <- plot_ly(x = fixedacidity, type="histogram", name = "Fixed Acidity")
p2 <- plot_ly(x = volatileacidity, type="histogram", name = "Volatile Acidity")
p3 <- plot_ly(x = citricacid, type="histogram", name = "Citric Acid")
p4 <- plot_ly(x = residualsugar, type="histogram", name = "Residual Sugar")
p5 <- plot_ly(x = chlorides, type="histogram", name = "Chlorides")
p6 <- plot_ly(x = freesulfurdioxide, type="histogram", name = "Free Sulfur Dioxide")
subplot(p1, p2, p3, p4, p5, p6, nrows=3)
p7 <- plot_ly(x = totalsulfurdioxide, type="histogram", name = "Total Sulfur Dioxide")
p8 <- plot_ly(x = density, type="histogram", name = "Density")
p9 <- plot_ly(x = pH, type="histogram", name = "PH")
p10 <- plot_ly(x = sulphates, type="histogram", name = "Sulphates")
p11 <- plot_ly(x = alcohol, type="histogram", name = "Alcohol")
p12 <- plot_ly(x = quality, type="histogram", name = "Quality")
subplot(p7, p8, p9, p10, p11, p12, nrows=3)
p1 <- plot_ly(x = Vinho, y = fixedacidity, color = Vinho, type="box", name = "Fixed Acidity")
p2 <- plot_ly(x = Vinho, y = volatileacidity, color = Vinho, type="box", name = "Volatile Acidity")
subplot(p1, p2, nrows=1)
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
p3 <- plot_ly(x = Vinho, y = citricacid, color = Vinho, type="box", name = "Citric Acid")
p4 <- plot_ly(x = Vinho, y = residualsugar, color = Vinho, type="box", name = "Residual Sugar")
subplot(p3, p4, nrows=1)
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
p5 <- plot_ly(x = Vinho, y = chlorides, color = Vinho, type="box", name = "Chlorides")
p6 <- plot_ly(x = Vinho, y = freesulfurdioxide, color = Vinho, type="box", name = "Free Sulfur Dioxide")
subplot(p5, p6, nrows=1)
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
p7 <- plot_ly(x = Vinho, y = totalsulfurdioxide, color = Vinho, type="box", name = "Total Sulfur Dioxide")
p8 <- plot_ly(x = Vinho, y = density, type="box", color = Vinho, name = "Density")
subplot(p7, p8, nrows=1)
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
p9 <- plot_ly(x = Vinho, y = pH, color = Vinho, type="box", name = "PH")
p10 <- plot_ly(x = Vinho, y = sulphates, color = Vinho, type="box", name = "Sulphates")
subplot(p9, p10, nrows=1)
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
p11 <- plot_ly(x = Vinho, y = alcohol, color = Vinho, type="box", name = "Alcohol")
p12 <- plot_ly(x = Vinho, y = quality, color = Vinho, type="box", name = "Quality")
subplot(p11, p12, nrows=1)
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
Descobertas:
Separando os dados entre RED VS WHITE
Vinhos %>%
filter(Vinho == "RED") -> RED
Vinhos %>%
filter(Vinho == "WHITE") -> WHITE
Analisando o equilibrio dos dois grupos no dataset
plot_ly(x = Vinho, type="histogram", name = "RED VS WHITE")
Criando função para facilitar os plots
plot_hist <- function(data1, name1, data2, name2, feature, title) {
trace1 <- list(
x = data1[,feature],
marker = list(line = list(
color = "rgb(217, 217, 217)",
width = 0
)),
name = name1,
opacity = 0.75,
type = "histogram",
visible = TRUE
)
trace2 <- list(
x = data2[,feature],
marker = list(
color = "rgb(23, 190, 207)",
line = list(
color = "rgb(217, 217, 217)",
width = 0
)
),
name = name2,
opacity = 0.75,
type = "histogram",
visible = TRUE
)
data <- list(trace1, trace2)
layout <- list(
autosize = TRUE,
barmode = "overlay",
height = 521,
hovermode = "closest",
legend = list(
x = 1.0208,
y = 0.943734015345
),
margin = list(
r = 50,
t = 65,
b = 65,
l = 65
),
showlegend = TRUE,
title = "",
width = 788,
xaxis = list(
anchor = "y2",
autorange = TRUE,
range = c(-3.2795847824, 4.52178374944),
title = title,
type = "linear"
),
yaxis = list(
autorange = TRUE,
domain = c(0.2, 1),
range = c(0, 0.0968421052632),
title = "Values",
type = "linear"
)
)
p <- plot_ly() %>%
add_trace(x=trace1$x, histnorm=trace1$histnorm, marker=trace1$marker, name=trace1$name, opacity=trace1$opacity, type=trace1$type, uid=trace1$uid, visible=trace1$visible, xbins=trace1$xbins) %>%
add_trace(x=trace2$x, histnorm=trace2$histnorm, marker=trace2$marker, name=trace2$name, opacity=trace2$opacity, type=trace2$type, uid=trace2$uid, visible=trace2$visible, xbins=trace2$xbins) %>%
layout(autosize=layout$autosize, barmode=layout$barmode, hovermode=layout$hovermode, legend=layout$legend, margin=layout$margin, showlegend=layout$showlegend, title=layout$title, xaxis=layout$xaxis, yaxis=layout$yaxis)
p
}
plot_hist(RED, "RED", WHITE, "WHITE", "fixedacidity", "Fixed Acidity")
plot_hist(RED, "RED", WHITE, "WHITE", "volatileacidity", "Volatile Acidity")
plot_hist(RED, "RED", WHITE, "WHITE", "citricacid", "Citric Acid")
plot_hist(RED, "RED", WHITE, "WHITE", "residualsugar", "Residual Sugar")
plot_hist(RED, "RED", WHITE, "WHITE", "chlorides", "Chlorides")
plot_hist(RED, "RED", WHITE, "WHITE", "freesulfurdioxide", "Free Sulfur Dioxide")
plot_hist(RED, "RED", WHITE, "WHITE", "totalsulfurdioxide", "Total Sulfur Dioxide")
plot_hist(RED, "RED", WHITE, "WHITE", "density", "Density")
plot_hist(RED, "RED", WHITE, "WHITE", "pH", "PH")
plot_hist(RED, "RED", WHITE, "WHITE", "sulphates", "Sulphates")
plot_hist(RED, "RED", WHITE, "WHITE", "alcohol", "Alcohol")
Descobertas:
Correlação das features dos vinhos de todos os tipos em números
v <- Vinhos %>% select(c(quality,fixedacidity,volatileacidity,citricacid,residualsugar,
chlorides,freesulfurdioxide,totalsulfurdioxide,density,pH,
sulphates,alcohol))
vw <- subset(Vinhos, Vinho=="WHITE", select=c(quality,fixedacidity,volatileacidity,citricacid,residualsugar,
chlorides,freesulfurdioxide,totalsulfurdioxide,density,pH,
sulphates,alcohol))
vr <- subset(Vinhos, Vinho=="RED", select=c(quality,fixedacidity,volatileacidity,citricacid,residualsugar,
chlorides,freesulfurdioxide,totalsulfurdioxide,density,pH,
sulphates,alcohol))
matcorV <- cor(v)
matcorVW <- cor(vw)
matcorVR <- cor(vr)
print(matcorV, digits = 2)
## quality fixedacidity volatileacidity citricacid
## quality 1.000 -0.077 -0.266 0.086
## fixedacidity -0.077 1.000 0.219 0.324
## volatileacidity -0.266 0.219 1.000 -0.378
## citricacid 0.086 0.324 -0.378 1.000
## residualsugar -0.037 -0.113 -0.200 0.142
## chlorides -0.201 0.298 0.377 0.039
## freesulfurdioxide 0.055 -0.283 -0.353 0.133
## totalsulfurdioxide -0.041 -0.329 -0.414 0.195
## density -0.310 0.465 0.270 0.095
## pH 0.020 -0.253 0.261 -0.330
## sulphates 0.038 0.300 0.226 0.056
## alcohol 0.435 -0.102 -0.044 -0.008
## residualsugar chlorides freesulfurdioxide
## quality -0.037 -0.201 0.055
## fixedacidity -0.113 0.298 -0.283
## volatileacidity -0.200 0.377 -0.353
## citricacid 0.142 0.039 0.133
## residualsugar 1.000 -0.130 0.406
## chlorides -0.130 1.000 -0.195
## freesulfurdioxide 0.406 -0.195 1.000
## totalsulfurdioxide 0.498 -0.280 0.721
## density 0.543 0.367 0.028
## pH -0.270 0.045 -0.146
## sulphates -0.188 0.396 -0.188
## alcohol -0.353 -0.256 -0.173
## totalsulfurdioxide density pH sulphates alcohol
## quality -0.041 -0.310 0.020 0.038 0.435
## fixedacidity -0.329 0.465 -0.253 0.300 -0.102
## volatileacidity -0.414 0.270 0.261 0.226 -0.044
## citricacid 0.195 0.095 -0.330 0.056 -0.008
## residualsugar 0.498 0.543 -0.270 -0.188 -0.353
## chlorides -0.280 0.367 0.045 0.396 -0.256
## freesulfurdioxide 0.721 0.028 -0.146 -0.188 -0.173
## totalsulfurdioxide 1.000 0.032 -0.238 -0.276 -0.256
## density 0.032 1.000 0.010 0.262 -0.688
## pH -0.238 0.010 1.000 0.192 0.121
## sulphates -0.276 0.262 0.192 1.000 -0.006
## alcohol -0.256 -0.688 0.121 -0.006 1.000
Correlação das features dos vinhos de todos os tipo visualização
corrgram(matcorV, type = "cor", lower.panel = panel.shade, upper.panel = panel.pie)
Correlação das features dos vinhos Brancos
corrgram(matcorVW, type = "cor", lower.panel = panel.shade, upper.panel = panel.pie)
Correlação das features dos vinhos Vermelhos
corrgram(matcorVR, type = "cor", lower.panel = panel.shade, upper.panel = panel.pie)
Descobertas:
Alcohol tem uma correlação positiva alta com Quality.Dióxido de Enxofre Livre e Dióxido de Enxofre Total tem uma grande correlação positiva e imagino que isso seja intuitivo, portanto podemos escolher apenas uma delas para usar no nosso modelo.Fixed Acidity, Density, Residual Sugar tem grandes correlações positivas e negativas entre sí e podem ser analisadas para descarte caso necessário.WHITE e RED apresentam algumas divergências.WHITE apresentam correlações consistentes com a matriz do dataset total, e isso pode ser explicado pelo fato de haver mais observações desse tipo de vinho, enquanto as do tipo RED apresentam variáveis como Volatile Acidity com correlação negativa a Quality e Citric Acid, Sulphates com correlação positiva maior com Quality. Fixed Acidity, Citric Acidity, Fixed Acidity, Density, PH, Volatile Acidity apresentam grandes correlações entre si tanto positivas como negativas, e podem ser analisadas para utilização de apenas uma dentre os pares.v <- Vinhos %>% select(c(fixedacidity,volatileacidity,citricacid,residualsugar,
chlorides,freesulfurdioxide,totalsulfurdioxide,density,pH,sulphates,alcohol))
prcomp(v, scale = T)
## Standard deviations (1, .., p=11):
## [1] 1.7411519 1.5791281 1.2464142 0.9854455 0.8453663 0.7789416 0.7235571
## [8] 0.7094322 0.5820891 0.4780996 0.1856424
##
## Rotation (n x k) = (11 x 11):
## PC1 PC2 PC3 PC4
## fixedacidity 0.24236004 -0.33547750 0.43240969 -0.16394790
## volatileacidity 0.38239507 -0.10993352 -0.30734109 -0.21119774
## citricacid -0.15062616 -0.18663415 0.59062252 0.26551064
## residualsugar -0.34375872 -0.33196390 -0.16573486 -0.16086481
## chlorides 0.29424864 -0.31037330 -0.01768063 0.24197100
## freesulfurdioxide -0.42974116 -0.07978302 -0.13732085 0.35624480
## totalsulfurdioxide -0.48582963 -0.09495997 -0.10989691 0.20783078
## density 0.05440968 -0.58297499 -0.17655419 -0.06719265
## pH 0.21741517 0.16191843 -0.45304319 0.41722469
## sulphates 0.29628830 -0.18696101 0.07006084 0.64133655
## alcohol 0.09418411 0.46646648 0.26397977 0.11655642
## PC5 PC6 PC7 PC8
## fixedacidity 0.1547602 0.20364057 0.27971559 -0.3963022602
## volatileacidity -0.1560440 0.49104757 0.39095268 0.0720670986
## citricacid 0.1497985 -0.22550146 0.38244529 0.2879511173
## residualsugar 0.3459300 0.24730219 -0.21431881 0.5319252460
## chlorides -0.6196015 -0.16073700 0.05007998 0.4690672718
## freesulfurdioxide -0.2275352 0.32825130 0.29793827 -0.2221646385
## totalsulfurdioxide -0.1602445 0.14420465 0.13874114 -0.1357970250
## density 0.3062889 -0.01424526 0.04784695 0.0009818868
## pH 0.4558231 -0.29275459 0.41678462 0.0378994953
## sulphates 0.1291003 0.29372378 -0.52732734 -0.1688302664
## alcohol 0.1731511 0.52804489 0.11009560 0.3925907281
## PC9 PC10 PC11
## fixedacidity -0.34687946 0.277654103 0.341651503
## volatileacidity 0.50031121 -0.149836074 0.079676500
## citricacid 0.40781678 -0.233092669 -0.005579421
## residualsugar -0.09717103 -0.007398810 0.446188821
## chlorides -0.29455744 0.191584862 0.049801574
## freesulfurdioxide -0.36234369 -0.483988851 0.002484427
## totalsulfurdioxide 0.31411739 0.716601945 -0.057705317
## density -0.11811012 0.008526929 -0.714957824
## pH -0.12636044 0.137903305 0.208754539
## sulphates 0.20903186 -0.044780650 0.076287477
## alcohol -0.25153240 0.201262921 -0.333812989
plot(prcomp(v, scale = T))
summary(prcomp(v, scale = T))
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 1.7412 1.5791 1.2464 0.98545 0.84537 0.77894
## Proportion of Variance 0.2756 0.2267 0.1412 0.08828 0.06497 0.05516
## Cumulative Proportion 0.2756 0.5023 0.6435 0.73181 0.79678 0.85194
## PC7 PC8 PC9 PC10 PC11
## Standard deviation 0.72356 0.70943 0.5821 0.47810 0.18564
## Proportion of Variance 0.04759 0.04575 0.0308 0.02078 0.00313
## Cumulative Proportion 0.89953 0.94528 0.9761 0.99687 1.00000
biplot(prcomp(v, scale = TRUE))
Descobertas:
Rotation que temos features com correlações muito fortes como: freesulfurdioxide e totalsulfurdioxide, também fixedacidity e volatileacidity, que já havíamos identificado em outras partes da análise e podem ser candidatas para utilizar apenas uma delas.v <- Vinhos %>% select(c(fixedacidity,volatileacidity,citricacid,residualsugar,
chlorides,freesulfurdioxide,totalsulfurdioxide,density,pH,sulphates,alcohol))
v <- rm.outlier(v, fill = T, median = T, opposite = FALSE)
vw <- rm.outlier(vw, fill = T, median = T, opposite = FALSE)
Descobertas:
Criaremos modelos separados para a tarefa de regressão e classificação da feature Quality, porque existem um número relativamente maior de observações de WHITE em relação a RED. Isso indica que podemos acabar gerando um modelo mais acertivo para vinhos brancos e nem tanto para vinhos vermelhos, esse desbalanço do dataset pode fazer com que as características dos dois tipos não apareçam no modelo. Porém gostaríamos de comprovar a ideia na parte do treinamento do modelo.
Substituir outliers com a mediana, mantendo a informação porém evitando prejudicar os modelos de regressão.
Não, os componentes gerados apresentaram uma variação proporcional muito baixa, e não vemos vantagens em utilizá-los.
A seguir definiremos 2/3 da base de vinhos brancos para treino, e 1/3 para teste
prt <- 2/3
set.seed(666)
treino <- sample(1:nrow(vw), as.integer(prt*nrow(vw)))
dataTreino <- vw[treino,]
dataTeste <- vw[-treino,]
Validando a consistência de qualidade entre as bases de treino e teste
prop.table(table(dataTreino$quality))
##
## 3 4 5 6 7 8
## 0.003675345 0.033384380 0.297396631 0.446554364 0.182235835 0.036753446
prop.table(table(dataTeste$quality))
##
## 3 4 5 6 7 8
## 0.004898959 0.033067973 0.297611758 0.456215554 0.174525413 0.033680343
Descobertas: - As proporções estão razoavelmente bem distribuídas entre as notas de qualidade. Isso leva a ter um bom treino para o modelo
Realizando a regressão linear com todas as variáveis
x <- lm(quality~fixedacidity+volatileacidity+citricacid+residualsugar+chlorides+freesulfurdioxide+totalsulfurdioxide+density+pH+sulphates+alcohol, data=dataTreino)
Analisando o summary
summary(x)
##
## Call:
## lm(formula = quality ~ fixedacidity + volatileacidity + citricacid +
## residualsugar + chlorides + freesulfurdioxide + totalsulfurdioxide +
## density + pH + sulphates + alcohol, data = dataTreino)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2254 -0.5019 -0.0524 0.4610 2.8019
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.163e+02 2.696e+01 8.023 1.43e-15 ***
## fixedacidity 1.332e-01 2.780e-02 4.790 1.75e-06 ***
## volatileacidity -1.819e+00 1.341e-01 -13.565 < 2e-16 ***
## citricacid 1.051e-01 1.185e-01 0.887 0.37495
## residualsugar 1.007e-01 1.023e-02 9.841 < 2e-16 ***
## chlorides 3.834e-01 6.868e-01 0.558 0.57666
## freesulfurdioxide 4.908e-03 1.059e-03 4.633 3.75e-06 ***
## totalsulfurdioxide -2.991e-04 4.658e-04 -0.642 0.52086
## density -2.177e+02 2.733e+01 -7.967 2.23e-15 ***
## pH 9.934e-01 1.354e-01 7.335 2.78e-13 ***
## sulphates 6.517e-01 1.208e-01 5.396 7.30e-08 ***
## alcohol 1.213e-01 3.418e-02 3.551 0.00039 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7379 on 3253 degrees of freedom
## Multiple R-squared: 0.3037, Adjusted R-squared: 0.3013
## F-statistic: 129 on 11 and 3253 DF, p-value: < 2.2e-16
Descobertas:
Eliminando variáveis não significativas
stepwise <- step(x,direction="both")
## Start: AIC=-1973.23
## quality ~ fixedacidity + volatileacidity + citricacid + residualsugar +
## chlorides + freesulfurdioxide + totalsulfurdioxide + density +
## pH + sulphates + alcohol
##
## Df Sum of Sq RSS AIC
## - chlorides 1 0.170 1771.2 -1974.9
## - totalsulfurdioxide 1 0.224 1771.2 -1974.8
## - citricacid 1 0.429 1771.4 -1974.4
## <none> 1771.0 -1973.2
## - alcohol 1 6.864 1777.9 -1962.6
## - freesulfurdioxide 1 11.685 1782.7 -1953.8
## - fixedacidity 1 12.489 1783.5 -1952.3
## - sulphates 1 15.853 1786.9 -1946.1
## - pH 1 29.295 1800.3 -1921.7
## - density 1 34.555 1805.6 -1912.1
## - residualsugar 1 52.726 1823.7 -1879.4
## - volatileacidity 1 100.183 1871.2 -1795.6
##
## Step: AIC=-1974.91
## quality ~ fixedacidity + volatileacidity + citricacid + residualsugar +
## freesulfurdioxide + totalsulfurdioxide + density + pH + sulphates +
## alcohol
##
## Df Sum of Sq RSS AIC
## - totalsulfurdioxide 1 0.225 1771.4 -1976.5
## - citricacid 1 0.530 1771.7 -1975.9
## <none> 1771.2 -1974.9
## + chlorides 1 0.170 1771.0 -1973.2
## - alcohol 1 6.877 1778.1 -1964.3
## - freesulfurdioxide 1 11.747 1782.9 -1955.3
## - fixedacidity 1 12.377 1783.6 -1954.2
## - sulphates 1 15.735 1786.9 -1948.0
## - pH 1 29.387 1800.6 -1923.2
## - density 1 34.677 1805.9 -1913.6
## - residualsugar 1 53.623 1824.8 -1879.5
## - volatileacidity 1 100.344 1871.5 -1797.0
##
## Step: AIC=-1976.5
## quality ~ fixedacidity + volatileacidity + citricacid + residualsugar +
## freesulfurdioxide + density + pH + sulphates + alcohol
##
## Df Sum of Sq RSS AIC
## - citricacid 1 0.507 1771.9 -1977.6
## <none> 1771.4 -1976.5
## + totalsulfurdioxide 1 0.225 1771.2 -1974.9
## + chlorides 1 0.170 1771.2 -1974.8
## - alcohol 1 6.684 1778.1 -1966.2
## - fixedacidity 1 12.795 1784.2 -1955.0
## - freesulfurdioxide 1 15.593 1787.0 -1949.9
## - sulphates 1 15.599 1787.0 -1949.9
## - pH 1 29.875 1801.3 -1923.9
## - density 1 38.465 1809.9 -1908.4
## - residualsugar 1 57.075 1828.5 -1875.0
## - volatileacidity 1 107.007 1878.4 -1787.0
##
## Step: AIC=-1977.56
## quality ~ fixedacidity + volatileacidity + residualsugar + freesulfurdioxide +
## density + pH + sulphates + alcohol
##
## Df Sum of Sq RSS AIC
## <none> 1771.9 -1977.6
## + citricacid 1 0.507 1771.4 -1976.5
## + chlorides 1 0.269 1771.7 -1976.1
## + totalsulfurdioxide 1 0.202 1771.7 -1975.9
## - alcohol 1 7.039 1779.0 -1966.6
## - fixedacidity 1 13.390 1785.3 -1955.0
## - sulphates 1 15.937 1787.8 -1950.3
## - freesulfurdioxide 1 16.212 1788.1 -1949.8
## - pH 1 29.391 1801.3 -1925.8
## - density 1 37.987 1809.9 -1910.3
## - residualsugar 1 56.589 1828.5 -1876.9
## - volatileacidity 1 112.459 1884.4 -1778.7
y <- lm(quality ~ fixedacidity + volatileacidity + residualsugar + freesulfurdioxide + density + pH + sulphates + alcohol, data=dataTreino)
summary(y)
##
## Call:
## lm(formula = quality ~ fixedacidity + volatileacidity + residualsugar +
## freesulfurdioxide + density + pH + sulphates + alcohol, data = dataTreino)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2194 -0.4993 -0.0511 0.4599 2.7910
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.161e+02 2.569e+01 8.411 < 2e-16 ***
## fixedacidity 1.344e-01 2.709e-02 4.960 7.40e-07 ***
## volatileacidity -1.851e+00 1.287e-01 -14.375 < 2e-16 ***
## residualsugar 1.001e-01 9.814e-03 10.197 < 2e-16 ***
## freesulfurdioxide 4.579e-03 8.388e-04 5.458 5.18e-08 ***
## density -2.174e+02 2.602e+01 -8.355 < 2e-16 ***
## pH 9.732e-01 1.324e-01 7.349 2.51e-13 ***
## sulphates 6.512e-01 1.203e-01 5.412 6.70e-08 ***
## alcohol 1.217e-01 3.383e-02 3.596 0.000327 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7377 on 3256 degrees of freedom
## Multiple R-squared: 0.3033, Adjusted R-squared: 0.3016
## F-statistic: 177.2 on 8 and 3256 DF, p-value: < 2.2e-16
pred1 <- predict(y, newdata = dataTeste)
RMSE(pred = pred1, obs = dataTeste$quality)
## [1] 0.7623217
Descobertas:
p-value é menor que 5% então podemos rejeitar a hipótese nulaR-squared é de 30, o que significa que a regressão linear não descreve o modelo com tanta precisãoConclusão:
Basicamente a árvore de regressão particiona o dataset em subgrupos menores e então estabelece um constante simples para cada observação daquele subgrupo. O particionamento é alcançado através de sucessivas partições binárias, também conhecido como particionamento binário recursivo, baseado nas diferentes características. A constante a ser prevista é baseado na média da resposta dos valores de todas as observações que caem naquele subgrupo.
As árvores de decisão tendem ao overfitting então é necessário fazer um fine tunning com alguns hyperparametros para ajusta-las, como por exemplo o numero máximo de galhos em um nó e o numero mínimo de observações que uma folha pode conter.
A variável target é a quality e as variáveis que utilizaremos para prever o seu valor são: fixedacidity, volatileacidity, citricacid, residualsugar, chlorides, freesulfurdioxide, totalsulfurdioxide, density, pH,sulphates,alcohol.
v <- Vinhos %>% select(c(fixedacidity,volatileacidity,citricacid,residualsugar,
chlorides,freesulfurdioxide,totalsulfurdioxide,density,pH,sulphates,alcohol))
set.seed(123)
sample <- sample.split(Vinhos$quality, SplitRatio = .70)
v_train <- subset(Vinhos, sample == TRUE)
v_test <- subset(Vinhos, sample == FALSE)
# Árvore de regressão sem fine tuning
m1 <- rpart(
formula = quality ~ .,
data = v_train,
method = "anova"
)
pred1 <- predict(m1, newdata = v_test)
RMSE(pred = pred1, obs = v_test$quality)
## [1] 0.7624472
rpart.plot(m1)
plotcp(m1)
Bagging
set.seed(123)
# Árvore de regressão usando bagging
m2 <- bagging(
formula = quality ~ .,
data = v_train,
coob = TRUE
)
# get OOB error
m2$err
## [1] 0.7441564
# predicion error
pred2 <- predict(m2, newdata = v_test)
RMSE(pred = pred2, obs = v_test$quality)
## [1] 0.7389704
Árvore de regressão usando bagging e 10-fold cross validation
set.seed(123)
# Specify 10-fold cross validation
ctrl <- trainControl(method = "cv", number = 10)
# CV bagged model
m3 <- train(
quality ~ .,
data = v_train,
method = "treebag",
trControl = ctrl,
importance = TRUE
)
# assess results
m3
## Bagged CART
##
## 4548 samples
## 12 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 4093, 4094, 4093, 4093, 4093, 4093, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.7426551 0.2798523 0.5878768
pred3 <- predict(m3, newdata = v_test)
RMSE(pred = pred3, obs = v_test$quality)
## [1] 0.7412857
plot most important variables
plot(varImp(m3), 20)
v <- Vinhos %>% select(c(fixedacidity,volatileacidity,citricacid,residualsugar,
chlorides,freesulfurdioxide,totalsulfurdioxide,density,pH,sulphates,alcohol))
v <- rm.outlier(v, fill = T, median = T, opposite = FALSE)
v %>% mutate(quality = quality) -> v
head(v)
## fixedacidity volatileacidity citricacid residualsugar chlorides
## 1 6.6 0.24 0.35 7.70 0.031
## 2 6.7 0.34 0.43 1.60 0.041
## 3 10.6 0.31 0.49 2.20 0.063
## 4 5.4 0.18 0.24 4.80 0.041
## 5 6.7 0.30 0.44 18.75 0.057
## 6 6.8 0.50 0.11 1.50 0.075
## freesulfurdioxide totalsulfurdioxide density pH sulphates alcohol
## 1 36 135 0.99380 3.19 0.37 10.5
## 2 29 114 0.99014 3.23 0.44 12.6
## 3 18 40 0.99760 3.14 0.51 9.8
## 4 30 113 0.99445 3.42 0.40 9.4
## 5 65 224 0.99956 3.11 0.53 9.1
## 6 16 49 0.99545 3.36 0.79 9.5
## quality
## 1 5
## 2 6
## 3 6
## 4 6
## 5 5
## 6 5
set.seed(123)
sample <- sample.split(v$quality, SplitRatio = .70)
v_train <- subset(v, sample == TRUE)
v_test <- subset(v, sample == FALSE)
# Árvore de regressão sem fine tuning
m1 <- rpart(
formula = quality ~ .,
data = v_train,
method = "anova"
)
pred1 <- predict(m1, newdata = v_test)
RMSE(pred = pred1, obs = v_test$quality)
## [1] 0.7607358
rpart.plot(m1)
plotcp(m1)
Bagging
set.seed(123)
# Árvore de regressão usando bagging
m2 <- bagging(
formula = quality ~ .,
data = v_train,
coob = TRUE
)
# get OOB error
m2$err
## [1] 0.7444525
# predicion error
pred2 <- predict(m2, newdata = v_test)
RMSE(pred = pred2, obs = v_test$quality)
## [1] 0.73926
Árvore de regressão usando bagging e 10-fold cross validation
set.seed(123)
# Specify 10-fold cross validation
ctrl <- trainControl(method = "cv", number = 10)
# CV bagged model
m3 <- train(
quality ~ .,
data = v_train,
method = "treebag",
trControl = ctrl,
importance = TRUE
)
# assess results
m3
## Bagged CART
##
## 4548 samples
## 11 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 4093, 4094, 4093, 4093, 4093, 4093, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.7430463 0.2790741 0.588015
pred3 <- predict(m3, newdata = v_test)
RMSE(pred = pred3, obs = v_test$quality)
## [1] 0.7415393
plot(varImp(m3), 20)
v <- Vinhos %>%
filter(Vinho == 'WHITE') %>%
select(c(fixedacidity,volatileacidity,citricacid,residualsugar,
chlorides,freesulfurdioxide,totalsulfurdioxide,density,pH,sulphates,alcohol))
v <- rm.outlier(v, fill = T, median = T, opposite = FALSE)
whiteQuality <- Vinhos %>%
filter(Vinho == 'WHITE') %>%
select(c(quality))
v %>% mutate(quality = whiteQuality$quality ) -> v
head(v)
## fixedacidity volatileacidity citricacid residualsugar chlorides
## 1 6.6 0.24 0.35 7.70 0.031
## 2 6.7 0.34 0.43 1.60 0.041
## 3 5.4 0.18 0.24 4.80 0.041
## 4 6.7 0.30 0.44 18.75 0.057
## 5 5.1 0.26 0.33 1.10 0.027
## 6 6.2 0.22 0.20 20.80 0.035
## freesulfurdioxide totalsulfurdioxide density pH sulphates alcohol
## 1 36 135 0.99380 3.19 0.37 10.5
## 2 29 114 0.99014 3.23 0.44 12.6
## 3 30 113 0.99445 3.42 0.40 9.4
## 4 65 224 0.99956 3.11 0.53 9.1
## 5 46 113 0.98946 3.35 0.43 11.4
## 6 58 184 1.00022 3.11 0.53 9.0
## quality
## 1 5
## 2 6
## 3 6
## 4 5
## 5 7
## 6 6
set.seed(123)
sample <- sample.split(v$quality, SplitRatio = .70)
v_train <- subset(v, sample == TRUE)
v_test <- subset(v, sample == FALSE)
# Árvore de regressão sem fine tuning
m1 <- rpart(
formula = quality ~ .,
data = v_train,
method = "anova"
)
pred1 <- predict(m1, newdata = v_test)
RMSE(pred = pred1, obs = v_test$quality)
## [1] 0.7724475
rpart.plot(m1)
plotcp(m1)
Bagging
set.seed(123)
# Árvore de regressão usando bagging
m2 <- bagging(
formula = quality ~ .,
data = v_train,
coob = TRUE
)
# get OOB error
m2$err
## [1] 0.7418527
# predicion error
pred2 <- predict(m2, newdata = v_test)
RMSE(pred = pred2, obs = v_test$quality)
## [1] 0.7523468
Árvore de regressão usando bagging e 10-fold cross validation
set.seed(123)
# Specify 10-fold cross validation
ctrl <- trainControl(method = "cv", number = 10)
# CV bagged model
m3 <- train(
quality ~ .,
data = v_train,
method = "treebag",
trControl = ctrl,
importance = TRUE
)
# assess results
m3
## Bagged CART
##
## 3429 samples
## 11 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 3085, 3086, 3087, 3086, 3086, 3087, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.7392487 0.3046547 0.5846012
pred3 <- predict(m3, newdata = v_test)
RMSE(pred = pred3, obs = v_test$quality)
## [1] 0.7534588
plot(varImp(m3), 20)
v <- Vinhos %>%
filter(Vinho == 'RED') %>%
select(c(fixedacidity,volatileacidity,citricacid,residualsugar,
chlorides,freesulfurdioxide,totalsulfurdioxide,density,pH,sulphates,alcohol))
v <- rm.outlier(v, fill = T, median = T, opposite = FALSE)
redQuality <- Vinhos %>%
filter(Vinho == 'RED') %>%
select(c(quality))
v %>% mutate(quality = redQuality$quality ) -> v
head(v)
## fixedacidity volatileacidity citricacid residualsugar chlorides
## 1 10.6 0.310 0.49 2.2 0.063
## 2 6.8 0.500 0.11 1.5 0.075
## 3 6.6 0.610 0.00 1.6 0.069
## 4 7.2 0.660 0.33 2.5 0.068
## 5 7.2 0.630 0.00 1.9 0.097
## 6 7.1 0.735 0.16 1.9 0.100
## freesulfurdioxide totalsulfurdioxide density pH sulphates alcohol
## 1 18 40 0.99760 3.14 0.51 9.8
## 2 16 49 0.99545 3.36 0.79 9.5
## 3 4 8 0.99396 3.33 0.37 10.4
## 4 34 102 0.99414 3.27 0.78 12.8
## 5 14 38 0.99675 3.37 0.58 9.0
## 6 15 77 0.99660 3.27 0.64 9.3
## quality
## 1 6
## 2 5
## 3 4
## 4 6
## 5 6
## 6 5
set.seed(123)
sample <- sample.split(v$quality, SplitRatio = .70)
v_train <- subset(v, sample == TRUE)
v_test <- subset(v, sample == FALSE)
# Árvore de regressão sem fine tuning
m1 <- rpart(
formula = quality ~ .,
data = v_train,
method = "anova"
)
pred1 <- predict(m1, newdata = v_test)
RMSE(pred = pred1, obs = v_test$quality)
## [1] 0.6611792
rpart.plot(m1)
plotcp(m1)
Bagging
set.seed(123)
# Árvore de regressão usando bagging
m2 <- bagging(
formula = quality ~ .,
data = v_train,
coob = TRUE
)
# get OOB error
m2$err
## [1] 0.6368096
# predicion error
pred2 <- predict(m2, newdata = v_test)
RMSE(pred = pred2, obs = v_test$quality)
## [1] 0.6350595
Árvore de regressão usando bagging e 10-fold cross validation
set.seed(123)
# Specify 10-fold cross validation
ctrl <- trainControl(method = "cv", number = 10)
# CV bagged model
m3 <- train(
quality ~ .,
data = v_train,
method = "treebag",
trControl = ctrl,
importance = TRUE
)
# assess results
m3
## Bagged CART
##
## 1120 samples
## 11 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1008, 1009, 1009, 1007, 1007, 1008, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.626993 0.4015921 0.4892482
pred3 <- predict(m3, newdata = v_test)
RMSE(pred = pred3, obs = v_test$quality)
## [1] 0.6412619
plot(varImp(m3), 20)
x <- data.frame(
"sem fine tuning" = c(0.7624472, 0.7607358, 0.7724475, 0.6611792),
"bagging" = c(0.7389704,0.73926, 0.7523468, 0.6350595),
"bagging 10k fold cross validation" = c(0.7412857, 0.7415393, 0.7534588, 0.6412619)
)
rownames(x) <- c("Sem Modificação", "Sem outliers", "Apenas com WHITE", "Apenas com RED")
x
## sem.fine.tuning bagging
## Sem Modificação 0.7624472 0.7389704
## Sem outliers 0.7607358 0.7392600
## Apenas com WHITE 0.7724475 0.7523468
## Apenas com RED 0.6611792 0.6350595
## bagging.10k.fold.cross.validation
## Sem Modificação 0.7412857
## Sem outliers 0.7415393
## Apenas com WHITE 0.7534588
## Apenas com RED 0.6412619
Descobertas:
RED parece trazer resultados melhores, conseguimos perceber que existe diferentes notáveis entre os dois tipos e que as variáveis relevantes para explicar RED são diferentes para WHITE.WHITE acabou gerando um resultado pior do que treinar um modelo para o dataset inteiro.Identificamos que as técnicas de árvore de regressão se sairam melhor descrevendo esse dataset.
v <- Vinhos %>%
mutate(aboveAverage = quality >= 7.0 )
v$aboveAverage <- as.factor(v$aboveAverage)
v <- v %>% select(c(fixedacidity,volatileacidity,citricacid,residualsugar,
chlorides,freesulfurdioxide,totalsulfurdioxide,density,pH,sulphates,alcohol, aboveAverage))
set.seed(123)
sample <- sample.split(v$aboveAverage, SplitRatio = .70)
v_train <- subset(v, sample == TRUE)
v_test <- subset(v, sample == FALSE)
# Árvore de regressão sem fine tuning
model.tree <- rpart(
formula = aboveAverage ~ .,
data = v_train
)
pred.tree <- predict(model.tree, v_test, type = 'class')
xlab <- table(actualclass=v_test$aboveAverage,predictedclass=pred.tree)
confusionMatrix(xlab)
## Confusion Matrix and Statistics
##
## predictedclass
## actualclass FALSE TRUE
## FALSE 1468 98
## TRUE 261 122
##
## Accuracy : 0.8158
## 95% CI : (0.7979, 0.8328)
## No Information Rate : 0.8871
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.305
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8490
## Specificity : 0.5545
## Pos Pred Value : 0.9374
## Neg Pred Value : 0.3185
## Prevalence : 0.8871
## Detection Rate : 0.7532
## Detection Prevalence : 0.8035
## Balanced Accuracy : 0.7018
##
## 'Positive' Class : FALSE
##
#Obtendo probabilidades da base de test
probs.tree <- predict(model.tree, newdata=v_test, type="prob")
#Calculando curva ROC
rocCurve.tree <- roc(v_test$aboveAverage, probs.tree[,"TRUE"])
#plot curva ROC
plot(rocCurve.tree, col=c(3))
#calculando a area abaixo da curva (quanto maior melhor)
auc(rocCurve.tree)
## Area under the curve: 0.7401
Bagging
set.seed(123)
# Árvore de regressão usando bagging
model.bagging <- bagging(
formula = aboveAverage ~ .,
data = v_train,
coob = TRUE
)
pred.bagging <- predict(model.bagging, v_test, type = 'class')
xlab <- table(actualclass=v_test$aboveAverage,predictedclass=pred.bagging)
confusionMatrix(xlab)
## Confusion Matrix and Statistics
##
## predictedclass
## actualclass FALSE TRUE
## FALSE 1476 90
## TRUE 150 233
##
## Accuracy : 0.8769
## 95% CI : (0.8614, 0.8911)
## No Information Rate : 0.8343
## P-Value [Acc > NIR] : 9.276e-08
##
## Kappa : 0.5855
## Mcnemar's Test P-Value : 0.0001398
##
## Sensitivity : 0.9077
## Specificity : 0.7214
## Pos Pred Value : 0.9425
## Neg Pred Value : 0.6084
## Prevalence : 0.8343
## Detection Rate : 0.7573
## Detection Prevalence : 0.8035
## Balanced Accuracy : 0.8146
##
## 'Positive' Class : FALSE
##
#Obtendo probabilidades da base de test
probs.bagging <- predict(model.bagging, newdata=v_test, type="prob")
#Calculando curva ROC
rocCurve.bagging <- roc(v_test$aboveAverage, probs.bagging[,"TRUE"])
#plot curva ROC
plot(rocCurve.bagging, col=c(3))
#calculando a area abaixo da curva (quanto maior melhor)
auc(rocCurve.bagging)
## Area under the curve: 0.8995
Bagging e 10-fold cross validation
set.seed(123)
# Specify 10-fold cross validation
ctrl <- trainControl(method = "repeatedcv", number = 10, allowParallel=TRUE)
# CV bagged model
model.treebag <- train(
as.factor(aboveAverage) ~ .,
data = v_train,
method = "treebag",
trControl = ctrl,
importance=TRUE
)
# assess results
pred.treebag <- predict(model.treebag, v_test, type = 'raw')
xlab <- table(actualclass=v_test$aboveAverage,predictedclass=pred.treebag)
confusionMatrix(xlab)
## Confusion Matrix and Statistics
##
## predictedclass
## actualclass FALSE TRUE
## FALSE 1481 85
## TRUE 158 225
##
## Accuracy : 0.8753
## 95% CI : (0.8598, 0.8897)
## No Information Rate : 0.8409
## P-Value [Acc > NIR] : 1.099e-05
##
## Kappa : 0.5746
## Mcnemar's Test P-Value : 3.860e-06
##
## Sensitivity : 0.9036
## Specificity : 0.7258
## Pos Pred Value : 0.9457
## Neg Pred Value : 0.5875
## Prevalence : 0.8409
## Detection Rate : 0.7599
## Detection Prevalence : 0.8035
## Balanced Accuracy : 0.8147
##
## 'Positive' Class : FALSE
##
#Obtendo probabilidades da base de test
probs.treebag <- predict(model.treebag, newdata=v_test, type="prob")
#Calculando curva ROC
rocCurve.treebag <- roc(v_test$aboveAverage, probs.treebag[,"TRUE"])
#plot curva ROC
plot(rocCurve.treebag, col=c(3))
#calculando a area abaixo da curva (quanto maior melhor)
auc(rocCurve.treebag)
## Area under the curve: 0.8932
Random Forest
set.seed(123)
# Specify 10-fold cross validation
ctrl <- trainControl(method = "repeatedcv", number = 10, allowParallel=TRUE)
# CV bagged model
model.rf <- train(
as.factor(aboveAverage) ~ .,
data = v_train,
method = "rf",
trControl = ctrl,
importance=TRUE
)
# assess results
pred.rf <- predict(model.rf, v_test, type = 'raw')
xlab <- table(actualclass=v_test$aboveAverage,predictedclass=pred.rf)
confusionMatrix(xlab)
## Confusion Matrix and Statistics
##
## predictedclass
## actualclass FALSE TRUE
## FALSE 1496 70
## TRUE 151 232
##
## Accuracy : 0.8866
## 95% CI : (0.8717, 0.9004)
## No Information Rate : 0.845
## P-Value [Acc > NIR] : 8.024e-08
##
## Kappa : 0.6098
## Mcnemar's Test P-Value : 7.392e-08
##
## Sensitivity : 0.9083
## Specificity : 0.7682
## Pos Pred Value : 0.9553
## Neg Pred Value : 0.6057
## Prevalence : 0.8450
## Detection Rate : 0.7676
## Detection Prevalence : 0.8035
## Balanced Accuracy : 0.8383
##
## 'Positive' Class : FALSE
##
#Obtendo probabilidades da base de test
probs.rf <- predict(model.rf, newdata=v_test, type="prob")
#Calculando curva ROC
rocCurve.rf <- roc(v_test$aboveAverage, probs.rf[,"TRUE"])
#plot curva ROC
plot(rocCurve.rf, col=c(3))
#calculando a area abaixo da curva (quanto maior melhor)
auc(rocCurve.rf)
## Area under the curve: 0.9067
Random Forest with boosting
set.seed(123)
# Specify 10-fold cross validation
ctrl <- trainControl(method = "repeatedcv", number = 10, allowParallel=TRUE)
# CV bagged model
model.gbm <- train(
as.factor(aboveAverage) ~ .,
data = v_train,
verbose=F,
method = "gbm",
trControl = ctrl
)
# assess results
pred.gbm <- predict(model.gbm, v_test, type = 'raw')
xlab <- table(actualclass=v_test$aboveAverage,predictedclass=pred.gbm)
confusionMatrix(xlab)
## Confusion Matrix and Statistics
##
## predictedclass
## actualclass FALSE TRUE
## FALSE 1475 91
## TRUE 225 158
##
## Accuracy : 0.8379
## 95% CI : (0.8207, 0.854)
## No Information Rate : 0.8722
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.4084
## Mcnemar's Test P-Value : 7.329e-14
##
## Sensitivity : 0.8676
## Specificity : 0.6345
## Pos Pred Value : 0.9419
## Neg Pred Value : 0.4125
## Prevalence : 0.8722
## Detection Rate : 0.7568
## Detection Prevalence : 0.8035
## Balanced Accuracy : 0.7511
##
## 'Positive' Class : FALSE
##
#Obtendo probabilidades da base de test
probs.gbm <- predict(model.gbm, newdata=v_test, type="prob")
#Calculando curva ROC
rocCurve.gbm <- roc(v_test$aboveAverage, probs.gbm[,"TRUE"])
#plot curva ROC
plot(rocCurve.gbm, col=c(3))
#calculando a area abaixo da curva (quanto maior melhor)
auc(rocCurve.gbm)
## Area under the curve: 0.8617
Analisando todas as curvas ROC dos modelos gerados
plot(rocCurve.tree,col=c(4)) # color blue is simple tree
plot(rocCurve.bagging ,add=TRUE,col=c(6)) # color magenta is bagging
plot(rocCurve.treebag ,add=TRUE,col=c(2)) # color red is treebag
plot(rocCurve.rf,add=TRUE,col=c(1)) # color black is rf
plot(rocCurve.gbm,add=TRUE,col=c(3)) # color green is gbm
Descobertas:
0.8866, área sob a curva ROC 0.9067Para gerar um modelo de regressão logística, primeiro devemos categorizar a variável target “quality” Definimos uma nova variável categórica (qualidade) com notas iguais ou acima de 7 sendo um vinho bom (1), e abaixo sendo ruim (0)
notaCorte = 7
rldataTreino <- dataTreino
rldataTeste <- dataTeste
attach(rldataTreino)
## The following objects are masked from Vinhos:
##
## alcohol, chlorides, citricacid, density, fixedacidity,
## freesulfurdioxide, pH, quality, residualsugar, sulphates,
## totalsulfurdioxide, volatileacidity
rldataTreino$qualidade <- ifelse(rldataTreino$quality >= notaCorte, 1,ifelse(rldataTreino$quality < notaCorte, 0,0))
rldataTeste$qualidade <- ifelse(rldataTeste$quality >= notaCorte, 1,ifelse(rldataTeste$quality < notaCorte, 0,0))
Realizando a regressão logística com todas as variáveis
x <- glm(qualidade~fixedacidity+volatileacidity+citricacid+residualsugar+chlorides+freesulfurdioxide+totalsulfurdioxide+density+pH+sulphates+alcohol, data=rldataTreino)
Analisando o summary
summary(x)
##
## Call:
## glm(formula = qualidade ~ fixedacidity + volatileacidity + citricacid +
## residualsugar + chlorides + freesulfurdioxide + totalsulfurdioxide +
## density + pH + sulphates + alcohol, data = rldataTreino)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.67232 -0.24878 -0.09711 0.06004 1.07338
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.746e+01 1.365e+01 6.409 1.67e-10 ***
## fixedacidity 6.943e-02 1.407e-02 4.934 8.45e-07 ***
## volatileacidity -2.624e-01 6.788e-02 -3.866 0.000113 ***
## citricacid -3.105e-03 5.996e-02 -0.052 0.958709
## residualsugar 3.853e-02 5.178e-03 7.441 1.27e-13 ***
## chlorides -1.514e-01 3.476e-01 -0.436 0.663124
## freesulfurdioxide 1.426e-03 5.362e-04 2.659 0.007884 **
## totalsulfurdioxide -1.137e-04 2.358e-04 -0.482 0.629779
## density -9.036e+01 1.383e+01 -6.533 7.48e-11 ***
## pH 4.303e-01 6.854e-02 6.278 3.90e-10 ***
## sulphates 2.974e-01 6.112e-02 4.865 1.20e-06 ***
## alcohol 3.641e-02 1.730e-02 2.105 0.035377 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.139463)
##
## Null deviance: 558.42 on 3264 degrees of freedom
## Residual deviance: 453.67 on 3253 degrees of freedom
## AIC: 2847.7
##
## Number of Fisher Scoring iterations: 2
Descobertas:
Descartando variáveis não significativas
stepwise <- step(x,direction="both")
## Start: AIC=2847.74
## qualidade ~ fixedacidity + volatileacidity + citricacid + residualsugar +
## chlorides + freesulfurdioxide + totalsulfurdioxide + density +
## pH + sulphates + alcohol
##
## Df Deviance AIC
## - citricacid 1 453.67 2845.7
## - chlorides 1 453.70 2845.9
## - totalsulfurdioxide 1 453.71 2846.0
## <none> 453.67 2847.7
## - alcohol 1 454.29 2850.2
## - freesulfurdioxide 1 454.66 2852.8
## - volatileacidity 1 455.76 2860.7
## - sulphates 1 456.97 2869.4
## - fixedacidity 1 457.07 2870.1
## - pH 1 459.17 2885.1
## - density 1 459.62 2888.3
## - residualsugar 1 461.40 2900.8
##
## Step: AIC=2845.74
## qualidade ~ fixedacidity + volatileacidity + residualsugar +
## chlorides + freesulfurdioxide + totalsulfurdioxide + density +
## pH + sulphates + alcohol
##
## Df Deviance AIC
## - chlorides 1 453.70 2843.9
## - totalsulfurdioxide 1 453.71 2844.0
## <none> 453.67 2845.7
## + citricacid 1 453.67 2847.7
## - alcohol 1 454.29 2848.2
## - freesulfurdioxide 1 454.66 2850.8
## - volatileacidity 1 455.82 2859.1
## - sulphates 1 456.98 2867.5
## - fixedacidity 1 457.11 2868.4
## - pH 1 459.20 2883.3
## - density 1 459.64 2886.4
## - residualsugar 1 461.41 2898.9
##
## Step: AIC=2843.95
## qualidade ~ fixedacidity + volatileacidity + residualsugar +
## freesulfurdioxide + totalsulfurdioxide + density + pH + sulphates +
## alcohol
##
## Df Deviance AIC
## - totalsulfurdioxide 1 453.73 2842.2
## <none> 453.70 2843.9
## + chlorides 1 453.67 2845.7
## + citricacid 1 453.70 2845.9
## - alcohol 1 454.32 2846.4
## - freesulfurdioxide 1 454.68 2849.0
## - volatileacidity 1 455.89 2857.7
## - sulphates 1 457.04 2865.9
## - fixedacidity 1 457.35 2868.1
## - pH 1 459.57 2883.9
## - density 1 460.00 2886.9
## - residualsugar 1 461.97 2900.9
##
## Step: AIC=2842.18
## qualidade ~ fixedacidity + volatileacidity + residualsugar +
## freesulfurdioxide + density + pH + sulphates + alcohol
##
## Df Deviance AIC
## <none> 453.73 2842.2
## + totalsulfurdioxide 1 453.70 2843.9
## + chlorides 1 453.71 2844.0
## + citricacid 1 453.73 2844.2
## - alcohol 1 454.32 2844.4
## - freesulfurdioxide 1 454.96 2849.0
## - volatileacidity 1 456.14 2857.4
## - sulphates 1 457.05 2863.9
## - fixedacidity 1 457.47 2867.0
## - pH 1 459.70 2882.8
## - density 1 460.71 2890.0
## - residualsugar 1 462.55 2903.0
y <- lm(qualidade ~ fixedacidity + volatileacidity + residualsugar + freesulfurdioxide + density + pH + sulphates, data=rldataTreino, family = "binomial")
## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
## extra argument 'family' will be disregarded
summary(y)
##
## Call:
## lm(formula = qualidade ~ fixedacidity + volatileacidity + residualsugar +
## freesulfurdioxide + density + pH + sulphates, data = rldataTreino,
## family = "binomial")
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.68649 -0.25037 -0.09891 0.06131 1.07570
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.152e+02 4.565e+00 25.237 < 2e-16 ***
## fixedacidity 9.068e-02 9.820e-03 9.234 < 2e-16 ***
## volatileacidity -2.482e-01 6.426e-02 -3.863 0.000114 ***
## residualsugar 4.812e-02 2.663e-03 18.068 < 2e-16 ***
## freesulfurdioxide 1.220e-03 4.243e-04 2.874 0.004074 **
## density -1.184e+02 4.690e+00 -25.252 < 2e-16 ***
## pH 5.249e-01 5.223e-02 10.049 < 2e-16 ***
## sulphates 3.349e-01 5.805e-02 5.769 8.7e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3735 on 3257 degrees of freedom
## Multiple R-squared: 0.1864, Adjusted R-squared: 0.1847
## F-statistic: 106.6 on 7 and 3257 DF, p-value: < 2.2e-16
Validando o ajuste do modelo
z <- predict(y, newdata=rldataTeste, type='response')
z <- ifelse(z > 0.5,1,0)
erro <- mean(z != rldataTeste$qualidade)
print(paste('Acurácia',1-erro))
## [1] "Acurácia 0.810165339865279"
Descobertas:
Conclusão:
Baseado nas métricas de acurácia e área sob a curva ROC podemos chegar a conclusão que a técnica Random Forest (árvore de decisão) obteve melhores resultados nesse dataset.
quais outras técnicas supervisionadas vocês indicariam como adequadas para esta análise?
Como técnica não supervisionada, vamos testar se o algoritmo de clusterização será adequado para agrupar dois conjunto de vinhos, categorizando-os como vinhos bons e vinhos ruins.
A variável ‘quality’, que identifica a nota do vinho, será a variável utilizada para correlacionar com as demais variáveis para identificar se existe algum agrupamento entre os vinhos.
Plotandos as relações das variáveis dos vinhos brancos:
plot(vw)
Clusterizando as variáveis:
Abaixo está sendo criada uma função para testar a variância dos dados em relação ao número de clusters:
elbow <- function(dataset){
wss <- numeric(15)
for (i in 1:15)
wss[i] <- sum(kmeans(dataset,centers=i,
nstart=100)$withinss)
plot(1:15, wss, type="b", main="Elbow method",
xlab="Number of Clusters",
ylab="Within groups sum of squares",
pch=8, col="red")
}
elbow(vw)
Observações:
Conforme é possível identificar, o plot está mostrando que para o dataset ‘vw’, o mais recomendado é utilizar 2 clusters.
Criando um cluster em 2 grupos para distinguir vinhos bons de vinhos ruins:
#Cluster WHITE:
set.seed(10)
modelo_white = kmeans(vw, centers = 2)
plot(vw, col=modelo_white$cluster)
points(modelo_white$centers, col = 4:1, bg = 1:4, pch = 24, cex=1, lwd=1)
Observações sobre o plot:
Analisando as relações, é possível identificar que as variáveis que o cluster conseguiu agrupar melhor em dois grupos foram as variáveis ‘totalsulfurdioxide’ e ‘freesulfurdioxide’ em relação a variável ‘quality’.
Analisando as variáveis ‘totalsulfurdioxide’ e ‘freesulfurdioxide’ separadamente:
vw%>%
select(totalsulfurdioxide, freesulfurdioxide, quality) -> vinhos_white.r2
plot(vinhos_white.r2)
Clusterizando as variáveis:
set.seed(10)
modelo_vinhos_white.r2 = kmeans(vinhos_white.r2,
centers = 2) #utilizando 2 clusters
plot(vinhos_white.r2,
col=modelo_vinhos_white.r2$cluster)
points(modelo_vinhos_white.r2$centers, col = 4:1, bg = 1:4, pch = 24, cex=1, lwd=1)
- Analisando o agrupamento, pode-se identificar que o cluster separou os vinhos em 2 grupos onde:
‘totalsulfurdioxide’ e ‘quality’: No grupo vermelho, ficaram os vinhos com ‘totalsulfurdioxide’ em média abaixo de 150 e os pretos acima de 150. ‘freesulfurdioxide’ e ‘quality’: No grupo vermelho, ficaram os vinhos com ‘freesulfurdioxide’ em média abaixo de 50 e os pretos acima de 50. Sendo que os grupos se juntão um pouco.
O grupo formado em ‘totalsulfurdioxide’ segue um agrupamento mais forte do que a variável ‘freesulfurdioxide’.
Conclusão sobre a técnica:
O algoritmo de cluster não é uma técnica muito adequada para agrupar os vinhos em categorias de vinhos bons e ruins.
O agrupamento realizado pelo algoritmo não classifica com muita precisão os vinhos de acordo com a variável ‘quality’. Podemos identificar que o grupo está distribuido em todas as notas de vinhos. Na nota 8, o grupo vermelho está em maior quantidade do que os pretos, na nota 3 para a variável ‘freesulfurdioxide’, o grupo preto está em maior quantidade em relação ao grupo vermelho.